Import at least pandas, numpy and matplotlib.pyplot

Check where your data is and what's the name of files (use !dir)

Read the _machine_2021_programsummary file to pandas DataFrame called df

Check the first 10 rows of df

List all the columns in df

Check types of all of the columns. Use for loop in order to print the following statements.

Now use DataFrame.info() to get the same information.

Check if there are any NaN values in df

Find the missing values

Get percentage of missing data

Drop the row with missing value

Reset index of the dataframe (since you removed one row), drop the old index. View first 15 rows. See if row #12 is there.

Change the type of 'start' and 'end' to datetime. Show df.info() after the change.

Plot values for all the parameters (columns) in the function of index.

Plot values for all the parameters (columns) as a function of column ['start']

Plot distribution for vibration and temperature parameters

Statistics for temperature look weird, right? We reached out to the client and find out that there was something wrong with the measurement. Lets drop it from now on and focus on vibration parameters.

Now find all possible combinations of vibration parameters (without repetitions). Print them out. (hint: use two for loops)

Plot scatter plots for all of the combinations (use code from previous task)

Seems that we have three clusters. Not sure of that? Let's make some fancy plots with seaborn (jointplot).

standard joint plot, bins=15

Hexagonal jointplot, bins=15, hex gridsize=12

Didn't we forget about the program? Let's see how many programs do we have (value counts)

So there are three programs as well. Try to make scatter plots colored by program name and see if they correspond to the clusters we have found.

Now you can see that some of the points are further away from their groups than it seemed at the beginnig (they belong to foreign clusters). How can we see that more clearly? Maybe we should include program in scatter plots?

How to plot categorical/string/textual/char variable? One idea may be to mapp it to integers.

Create a dictionary to map 'A11' to 1, 'A12' to 2 ...

Map values and show the dataframe.

Make the scatter plots again, include 'program'

Now you can clearly see working points that are far away from the usual working points in a particular program.

Finally, let's calculate LOF.

Import LocalOutlierFactor from sklearn. If you don't have sklearn package - install it!

Create dataframe with columns you would like to include in training.

Define the classifier object. Have a look at the documentation, check out the description and examples. Define at least three LOF parameters - n_neighbors, novelty and contamination. Read the docs, have a look at your dataset and think of possible good values. Hint: Good values are the ones that give you result you expect :)

Train the classifier object (fit to training data)

Get the negative outlier factor (check the docs again)

Make the scatter plots again, this time color the marks by the value of prediction

You can see that points that doesn't belong to their program cluster where not marked as outliers. Any idea why? Think of the way LOF is calculated. Try to make it better (see better results below).